HEAD
ANOVA
Multifactor ANOVA
Nested design examples
Often consider more than 1 factor (independent categorical variable):
2-factor designs (2-way ANOVA) very common in ecology
Most multifactor designs: nested or factorial
Consider two factors: A and B
Nested Designs:
Factorial Designs:
Study on effects of enclosure size on limpet growth:
Study on reef fish recruitment: 5 sites (factor A) 6 transects at each site (factor B) replicate observations along each transect
Effects of sea urchin grazing on biomass of filamentous algae:
F
Effects of light level on growth of seedlings of different size:
Effects of food level and tadpole presence on larval salamander growth
Effect of season and density on limpet fecundity.
F
Consider a nested design with:
I
Can calculate several means:
Where:
\(y_{ijk}\) is the response variable
value of the k-th replicate in j-th level of B in the i-th level of A
(algal biomass in 3rd quadrat, in 2nd patch in low grazing treatment)
\(\mu\) is the overall mean
The linear model for a nested design is:
\(\alpha_i\) is the fixed effect of factor \(i\)
(difference between average biomass in all low grazing level quadrats and overall mean)
\(\beta_{j(i)}\) is the random effect of factor \(j\) nested within factor \(i\)
usually random variable, measuring variance among all possible levels of B within each level of A
(variance among all possible patches that may have been used in the low grazing treatment)
The linear model for a nested design is:
As before, partition the variance in the response variable using SS SSA is SS of differences between means in each level of A and overall mean
SSB is SS of difference between means in each level of B and the mean of corresponding level of A summed across levels of A
Two hypotheses tested on values of MS:
Two hypotheses tested on values of MS:
“significant variation between replicate patches within each treatment, but no significant difference in amount of filamentous algae between treatments”
Unequal sample sizes can be because of:
Not a problem, unless have unequal variance or large deviation from - normality
As usual, we assume
Equal variance + normality need to be assessed at both levels:
Covered
<<<<<<< HEADRegression T-Test Anova
Regression Assumptions
Model II Regression
>>>>>>> origin/mainMultiple Linear Regression model
What if more than one predictor (X) variable?
| Independent variable | ||
|---|---|---|
| Dependent variable | Continuous | Categorical |
| Continuous | Regression | ANOVA |
| Categorical | Logistic regression | Tabular |
Abundance of C3 grasses can be modeled as function of
Instead of line, modeled with (hyper)plane
Used in similar way to simple linear regression:
S
Crawley 2012: “Multiple regression models provide some of the most profound challenges faced by the analyst”:
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
yi: value of Y for ith observation X1 = xi1, X2 = xi2,…, Xp = xip
β0: population intercept, the mean value of Y when X1 = 0, X2 = 0,…, Xp = 0
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
β1: partial population slope, change in Y per unit change in X1 holding other X-vars constant
β2: partial population slope, change in Y per unit change in X2 holding other X-vars constant
βp: partial population slope, change in Y per unit change in Xp holding other X-vars constant
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
εi: unexplained error - difference bw yi and value predicted by model (ŷi)
NPP = β0 + β1(lat) + β2 (long) + β3 (soil fertility) + εi
Multiple Regression:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \epsilon_i\]
Regression equation can be used for prediction by subbing new values for predictor (X) variables
Confidence intervals calculated for parameters
Confidence and prediction intervals depend on number of observations and number of predictors
Prediction should be restricted to within range of X variables
Variance - SStotal partitioned into SSregression and SSresidual
SSregression is variance in Y explained by model
SSresidual is variance not explained by model
| Source of variation | SS | df | MS | Interpretation |
|---|---|---|---|---|
| Regression | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(p\) | \(\frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{p}\) | Difference between predicted observation and mean |
| Residual | \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\) | \(n-p-1\) | \(\frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n-p-1}\) | Difference between each observation and predicted |
| Total | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(n-1\) | Difference between each observation and mean |
SS converted to non-additive MS (SS/df)
| Source of variation | SS | df | MS |
|---|---|---|---|
| Regression | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(p\) | \(\frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{p}\) |
| Residual | \(\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\) | \(n-p-1\) | \(\frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n-p-1}\) |
| Total | \(\sum_{i=1}^{n} (y_i - \bar{y})^2\) | \(n-1\) |
Two Hos usually tested in MLR:
Also: is any specific β = 0 (explanatory role)?
\[F_{w,n-p} = \frac{MS_{Extra}}{FULL\ MS_{Residual}} \] Can also use t-test (R provides this value)
Explained variance (r2) is calculated the same way as for simple regression:
\[r^2 = \frac{SS_{Regression}}{SS_{Total}} = 1 - \frac{SS_{Residual}}{SS_{Total}} \]
More observations than predictor variables
Regression of Y vs. each X does not consider effect of other predictors:
want to know shape of relationship while holding other predictors constant
::::
>>>>>>> origin/mainCollinearity can be detected by:
Variance inflation Factors:
Best/simplest solution:
Predictors can be modeled as:
\[y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \epsilon_i \quad \text{vs.} \quad y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + + \beta_3X_{i3} \epsilon_i\]
“Curvature” of the regression (hyper)plane
Adding interactions:
Multiple Linear Regression accommodates continuous and categorical variables (gender, vegetation type, etc.) Categorical vars as “dummy vars”, n of dummy variables = n-1 categories
Sex M/F:
Fertility L/M/H:
| Fertility | fert1 | fert2 |
|---|---|---|
| Low | 0 | 0 |
| Med | 1 | 0 |
| High | 0 | 1 |
Coefficients interpreted relative to reference condition
| Fertility | fert1 | fert2 |
|---|---|---|
| Low | 0 | 0 |
| Med | 1 | 0 |
| High | 0 | 1 |
S
When have multiple predictors (and interactions!)
To chose:
Overfitting
Need to account for increase in fit with added predictors:
\[\text{Adjusted } r^2 = 1 - \frac{SS_{\text{Residual}}/(n - (p + 1))}{SS_{\text{Total}}/(n - 1)}\] \[\text{Akaike Information Criterion (AIC)} = n[\ln(SS_{\text{Residual}})] + 2(p + 1) - n\ln(n)\]
But how to compare models?
Can fit all possible models
Automated forward (and backward) stepwise procedures: start w no terms (all terms), add (remove) terms w largest (smallest)
We will use manual form of backward selection
Usually want to know relative importance of predictors to explaining Y
Using F-tests (or t-tests) on partial regression slopes:
Using coefficient of partial determination:
\[r_{X_j}^2 = \frac{SS_{\text{Extra}}}{\text{Reduced }SS_{\text{Residual}}}\]
SSextra
Increased in SSregression when Xj is added to model
Reduced SSresidual is the unexplained SS from model without Xj
Using standardized partial regression slopes:
Using partial r2 values:
Results are easiest to report in tabular format
Results are easiest to report in tabular format